LEARNING AND DIAGNOSING FAULTS 
USING NEURAL NETWORKS 


N96- 70675 


Bruce A. Whitehead 
Earl L. Kiech 
Moonis Ali 

Center For Advanced Space Propulsion 
The University Of Tennessee Space Institute 
Tullahoma, TN 37388 


Abstract 

Neural networks have been employed for learn- 
ing fault behavior from rocket engine simulator pa- 
rameters and for diagnosing faults on the basis of the 
learned behavior. Two problems in applying neural 
networks to learning and diagnosing faults are (1) the 
complexity of the sensor data to fault mapping to be 
modeled by the neural network, which implies dif- 
ficult and lengthy training procedures; and (2) the 
lack of sufficient training data to adequately repre- 
sent the very large number of different types of faults 
which might occur. Methods are derived and tested 
in an architecture which addresses these two prob- 
lems. First, the sensor data to fault mapping is de- 
composed into three simpler mappings which perform 
sensor data compression, hypothesis generation, and 
sensor fusion. Efficient training is performed for each 
mapping separately. Secondly, the neural network 
which performs sensor fusion is structured to detect 
new unknown faults for which training examples were 
not presented during training. These methods were 
tested on a task of fault diagnosis by employing rocket 
engine simulator data. Results indicate that the de- 
composed neural network architecture can be trained 
efficiently, can identify faults for which it has been 
trained, and can detect the occurrence of faults for 
which it has not been trained. 

Introduction 

The objective of our research described in this 
paper is to employ neural networks for (i) learning 
fault behavior from rocket engine simulator parame- 
ters perturbed with noise (termed as sensor data in 
this paper), and (ii) diagnosing faults on the basis 
of the learned behavior. In a complex system (such 
as a liquid-fuel rocket engine), there are many pos- 
sible ways in which components of the system may 
fail. Only a fraction of these possible failures have 
been observed and are known to human experts. Hu- 
man experts have not seen all possible instances of all 
faults and hence cannot describe the features of the 
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faults sufficiently well to make diagnostic decisions. 

This state of affairs is problematic for training a 
neural network to recognize component failures based 
on sensor data. Neural networks learn to recognize 
faults by being trained with examples of these faults. 
They are capable of generalizing from some examples 
of a fault to other examples of the same fault, but they 
are not capable of recognizing a new fault for which 
no training examples have been given. If a neural 
network is trained to recognize a set of faults and then 
presented with an example of a completely new fault, 
it will typically either (i) find the closest match to 
the new example among the previously trained faults, 
or (ii) classify the new example as an interpolative 
“blend” of previously trained faults. 

Neither of these classification strategies is appro- 
priate for recognizing a failure which is different from 
the classes of failures for which the neural network has 
been trained. What is needed, rather, is an ability to 
recognize that the new failure is not an example of 
any previously trained fault. In other words, the neu- 
ral network must be capable of recognizing the classes 
for which it has been trained as well as one additional 
“unknown” class, even though no training examples 
are available for this unknown class. 

If the weights in a neural network are determined 
solely by the training examples, then the subsequent 
behavior of the network is also determined by these 
training examples. Such a network would only be able 
to classify new examples on the basis of the examples 
it has seen, and would not be expected to reliably 
recognize an “unknown” class. We therefore have de- 
veloped a neural network architecture in which the 
weights are not determined solely by the training ex- 
amples. Instead, the weights are determined partly 
by expert judgment about the type of classification 
to be performed, and partly by conventional back- 
propagation training from examples. 

This architecture has been tested in the task of 
sensor fusion of data from the rocket engine simula- 
tor. The purpose of the sensor fusion architecture is 
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to classify faults of sensor readings as either (i) exam- 
ples of normal steady-state operation, (ii) examples of 
known classes of component failures, or (Hi) examples 
of an unknown class of anomalous behavior. 

Sensor data typical of a rocket engine fault 
(with and without the addition of simulated noise) are 
depicted in Fig. 1. Four sensors, high pressure fuel 
turbopump (HPFT) temperature, thrust, chamber 
coolant valve (CCV) pressure, and main fuel valve 
(MFV) pressure are monitored during engine opera- 
tion. The data are normalized with steady-state 
values for convenience. In the present effort, it is 
assumed that the engine is operating at some steady- 
state condition when the fault condition occurs. The 
fault will be manifested as a deviation of sensor values 
from the steady-state condition. In the present effort, 
a time window containing 40 sensor readings span- 
ning four seconds is used. 

Itis not sufficient for the purposes of diagnosis 
to simply detect when and whether a deviation from 
steady-state conditions has occurred; how the devia- 
tion is manifested over time is also important. For 
instance, an observation that a particular sensor 
parameter is decreasing linearly will likely result in a 
different diagnosis than that obtained from observing 
an asymptotic decrease 1 . Therefore, to be effective, a 
diagnostic system must be responsive to the qualita- 
tive (as well as the quantitative) behavior of the en- 
gine. The diagnostic process must also exhibit 
resilience to noise. A noise-corrupted version of the 
fault is therefore depicted in Fig. I for comparison. 
The 2% noise level means that the standard deviation 
for a Gaussian distribution of perturbations about the 
noise-free curves is 0.02. A 2% noise level would be 
considered excessive for most instrumentation; how- 
ever, satisfactory operation at this noise level is used 
as a goal in the present effort. 

Neural networks have been employed at UTSI 
in the past to diagnose the development of fault con- 
ditions in jet engines 2 " 4 , using conventional feedfor- 
ward networks trained with well-known 
back-propagation algorithms 6 . Although this method 
was effective in diagnosing faults when given samples 
of data corresponding to a fault for which the networks 
were trained, several deficiencies of the conventional 
feedforward model were noted: 

1). Networks can be trained to associate a set 

of patterns with a set of fault conditions quite readily. 
However, when presented with an input pattern 
qualitatively different from those included in the 





FIGURE 1 . Sensor outputs for a 30% blockage of the 
- main fuel valve with 1 .5 second onset Interval. 


training examples (e.g., a pattern representing a pre- 
viously unobserved fault condition), the networks can 
produce spurious categorizations and false positive 
identifications. There is no method to train a conven- 
tional feedforward network to, for instance, activate 
an output node if the input pattern is not similar to 
those patterns for which the network is trained. 

2). Conventional feedforward networks are 
limited in the number and type of training examples 
which can be used. Some of the networks in the jet 
engine diagnostic system required long training 
times, which depended not only on the number of 
training examples required, but also on how similar 
training examples representing different faults were 
to each other. 

These deficiencies can adversely effect the 
application of neural networks to the classification of 
fault patterns. The present work attempts to formu- 
late a neural network architecture and training 
regimen which will successfully extend the 
capabilities of feedforward network models. 
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Architecture and Method 

Consider the problem of inferring the operat- 
ing conditions of a system from sensor data. Suppose 
there are M different possible system conditions 
Ci... C m and N different sensors Si...Sn- In general, 
each possible condition Ci may be quantitatively char- 
acterized by a set ofP parameters cip, p = 1,2,. ..JP. (In 
our current research, each condition is characterized 
by two parameters, relating to the severity and onset 
interval of faults in jet and rocket engines.) The set of 
all possible system conditions can then be charac- 
terized by the set of parameters {c; p }, where i 
enumerates the possible conditions and p enumerates 
the parameters associated with each condition. To 
infer these conditions, we may sample the value of 
each sensor Sj over a period of time. If each sensor is 
sampled at discrete times t - 1...T, then the complete 
set of sensor data is represented by the set of values 
[sjt), where j enumerates the different sensors and t 
enumerates the sampling times. 

From the point of view of its operation, a 
system can be represented as a function 

F:{cip}-»{s/«} 

winch represents the cause-and-effect relationship 
from system conditions to sensor values. Ideally, if all 
conditions other than those enumerated could be held 
constant, F would be a deterministic function. In prac- 
tice, however, F is usually characterized as a stochas- 
tic function in which values of sjt are assumed to be 
influenced by noise as well as by the conditions {c/p}. 
This noise results both from variability in testing 
conditions and from unreliability of the sensors. 
Sample data points for this function F can be derived 
by observing the physical system (under known test 
conditions) and/or by manipulating a simulation of the 
physical system. 

In either case, the problem of inferring the 
condition of the system from observed sensor values 
is the problem of deriving the inverse function 

F - 1 : [sjt) {*>} 

which is the mapping from sensor values to 
hypothesized underlying conditions. For complex 
real-life systems such as jet and rocket engines, the 

nature of the function F is unknown. However, 
observations of sensor values caused by known condi- 


tions (i.e. observations of F) can be used as example 
data points for F -1 . Itmight be thought that F -1 could 
readily be learned from these examples by a neural 
network trained with back-propagation. Each ob- 
served set of sensor values {s/t} would be input to the 
network, and the known conditions {cipj would be the 
desired outputs for back-propagation training. In tri- 
als using jet and rocket engine data, however, we 
observed that two different fault conditions may cause 
very similar sensor behavior and also that low- 
severity, slowly developing faults are very difficult to 
distinguish from normal behavior. Such differential 
diagnoses have proven very difficult for an unstruc- 
tured back-propagation algorithm to learn. 

We propose that learning of the function F -1 
from sample data points can be more efficient if F -1 
is decomposed and structured into special-purpose 
constituent functions, each of which can be learned 
separately. The proposed decomposition is 

F~ 1 =R *M*A 


(where * denotes the composition of functions) 

The purpose of the function I? is data compres- 
sion: to reduce the dimensionality of the sensor data 
{s/f} without losing the information necessary to make 
valid inferences. The purpose of the function M is to 
map this reduced sensor data {s %) into hypothesized 
conditions [c'j p \. The functions R and M are applied 
separately to each sensory. Finally, the purpose of the 
function A is to perform sensor fusion, i.e., to arbitrate 
among the hypotheses nominated by different sensors 
to determine which hypothesis is most likely. Each 
special-purpose component function R, M, and A is 
represented independently and trained separately. 
While the sensor fusion architecture A is the focus of 
the present study, the preprocessing functions R and 
M are discussed briefly in the following two subsec- 
tions. The alternative sensor fusion architectures that 
were compared are then discussed in detail, and the 
results of the comparison are given. 

Step 1: Data Compression 

The data compression function R above is 
performed by an autoassociative network 6 with at 
least one layer of semi-linear hidden nodes. There are 
T input nodes and T output nodes, one for each dis- 
crete sampling time t ~ 1 ,...,T. Data compression is 
accomplished by connecting the T input nodes to the 
T output nodes through a much smaller set of 
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H « Thidden nodes. Node activations are continuous 
variables. The network is trained to associate input 
patterns with identical patterns clamped at the out- 
put nodes. As a result, training is unsupervised; it is 
left to the weights between the input, hidden, and 
output layers of the network to organize in a fashion 
whereby the mapping is performed correctly. 

Once training has been completed, classifica- 
tion is accomplished by examining the activations of 
the hidden nodes. In a sense, the hidden node activa- 
tions are employed as the "output" of the network. The 
mapping from input nodes to hidden nodes reduces 
the dimensionality of the input data from T (the num- 
ber of time-series samples for a given sensor) t oH (the 
number of hidden node activations). The mapping 
from hidden nodes to output nodes, on the other hand, 
has been used to reproduce the original input pattern 
from the hidden node activations. To the extent that 
training is successful, therefore, the H hidden node 
activations will contain sufficient information to 
reproduce the T time-series sensor values. The H 
hidden node activations for each sensor Sj thus com- 
press that sensor’s time-series values ($/*}, t = 1,...,T 
into its hidden node activations (s'ji,...,s'jH) =i2({sy(}), 
where R is the weighted-summation performed by the 
connections from the input layer to the hidden layer 
of the autoassociative network. 

This special-purpose network for reducing the 
dimensionality of the sensor data can be trained very 
economically. Suppose that it is desired to train the 
overall architecture R * M * A to classify a large num- 
ber of temporal patterns (where each temporal pat- 
tern consists of a set of data curves, i.e. a set of 
time-series values for all sensors). While all of these 
curves will be classified by the mapping M of step 2 
below, only a small subset of the training curves is 
needed to train the data compression network R. For 
training an autoassociative network to perform R, a 
small representative subset of the total set of curves 
is sufficient. Training is done on this subset of curves 
using an autoassociative back-propagation algorithm 
with output nodes clamped to the same values as the 
input nodes. After trainingis complete, data compres- 
sion is accomplished by the resulting set of weights 
from the T input nodes to the H hidden nodes, as 
explained above. 

Step 2: Hypothesis Generation 

Once training has been completed for the 
autoassociative network above, the network is then 
run (with fixed weights) on the entire set of desired 
training data. Since no learning is involved at this 


stage, each sensor data curve (sjt), t = 1 ,...,T can be 
collapsed into its hidden-node representation 
{s h = 1,..// in one iteration of the network. The 
H hidden node activations for each sensor constitute 
a compressed representation of the input data for that 
sensor. The entire set of training data can therefore 
be converted into a compressed representation with 
very little computation, i.e., only one iteration of the 
network per data curve per sensor. 

Each data curve will evoke a specific hidden 
node response which can be represented as a single 
point (s'ji,...,s'jH) in the H-dimensional parameter 
space of hidden node activations. Our studies have 
shown that data curves generated by smoothly vary- 
ing the quantitative parameters {cip} of a given fault 
condition Cj result in hidden node responses which 
map out a surface. If H hidden nodes are used, the 
hidden node activations will define a surface in an 
if-dimensional parameter space. For example, Figure 
2 illustrates surfaces generated by training an autoas- 
sociative network, with three hidden nodes, on thrust 
data from the Space Shuttle Main Engine. The larger 
surface is generated from hidden node activations 
which correspond to blockage at the main oxidizer 
valve; the smaller surface corresponds to blockage of 
the main fuel valve. Both surfaces represent fault 
conditions over a range of severities and onset inter- 
vals. Hie hidden-node activation resulting from nor- 
mal (steady-state) operation is represented as a single 
point in the parameter space. 

In the present example, sensor curve charac- 
teristics are functions of two parameters: severity 
and onset interval for each condition Ci. Curves which 
vary by only one parameter cii (e.g., main oxidizer 
valve blockages of the same severity, but of different 
onset intervals) will evoke corresponding hidden node 
activations R{Fj(cn)) which define one coordinate 
direction of the main oxidizer valve surface. That is, 
Fj(cii) denotes the set of time-series data curves of 
sensor Sj when parameter cii is varied. R in turn 
compresses each curve into a point in the hidden-node 
space. Curves which vary by another parameter Ci 2 
(e.g., main oxidizer valve blockages which vary in 
severity, but are constant in onset time) will evoke 
hidden node activations R(Fj(ci2)) which define the 
other coordinate direction of the surface. The entire 
surface can therefore be mapped out by systematically 
varying both c«i and c£2. If c;ixci2 denotes the set of all 
such combinations of values of the fault parameters 
ca and c£2, then Fjiaxxca) denotes the set of time- 
series curves for sensor fa responses over this range 
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FIGURE 2. Surfaces representing Main Oxidizer Valve and Main Fuel Valve blockages for Thrust sensor. The 
autoassoclative network maps each possible Input pattern Into one point HI , H2, H3, representing the hidden 
node activations which result from that Input pattern. A class of related Input patterns will generate a surface 
In this space. 

of parameters. R(Fj(cnxci2)) in turn denotes the sur- val) of the example which generated the point; and (iii) 

face obtained by compressing these sensor curves into interpolating a coordinatized surface from the set of 

points in the iT-dimensional hidden-node space. points obtained by varying the parameters of a given 

Similarly, a set of main fuel valve blockages covering condition, 

a range of severities and onset intervals can be used 

to define another surface. Note that each different sensor Sj will 

generate its own set of surfaces infT-dimensional data 
As explained above, the entire set of training compression space, showing the responses of that 

data can be converted into points in the hidden-node sensor to the range of conditions in the training data, 

parameter space in one iteration of the autoassocia- There will therefore be a separate map Mj for each 

tive network per data curve per sensor. In the present sensor Sj. 

implementation, these points are simply interpolated 

to generate a parameter surface for each fault condi- This training process does not in any way 

tion. "Training" the map M from hidden-node activa- require that only three hidden nodes be used, but is 

tions to parameters consists merely of (i) generating fully extensible to higher numbers of hidden nodes. If 

a point in the hidden-node parameter space for each higher numbers of hidden nodes are used, higher- 
training example; (ii) labeling each such point with dimensional surfaces (hypersurfaces) will be 

the known parameters (e.g. severity and onset inter- generated. 
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Once a network is trained and the surfaces 
generated, it may be used for classification of new 
input patterns. In the example of Figure 2, a new input 
will be mapped into a new point in multi-dimensional 
space. If the point lies on or near the surface defined 
by the training examples, then the resulting 
hypothesis is that the fault condition represented by 
that surface is indeed occurring. The closer the new 
point is to the surface, the stronger is the evidence it 
provides for that hypothesis. If the new point is close 
to more than one surface, then more than one 
hypothesis will be generated, but with different levels 
of confidence if the distances to the surfaces are dif- 
ferent. 

After training is complete for steps 1 and 2 
above, a new sensor curve [sjt], t = 1 can be con- 
verted into a new point (s'ji,...,s'jH) in the hidden-node 
parameter space, where (s'ji,...,s f jH) =JR({s/<}), R being 
the mapping performed by the data compression net- 
work of step 1. This point in turn can be projected onto 
each surface R{Fj(cnxciz)) yielding values for the 
parameters a i and ca of the hypothesized condition 
Ci. The degree of evidence for each hypothesized con- 
dition is a function of the distance from the new point 
to the surface representing that condition, as dis- 
cussed below. If j sensors are operating simultaneous- 
ly, then this process is carried out separately for each 
sensor and each surface, projecting to the surfaces 
that were generated for that particular sensor during 
training. This will yield a different opinion from each 
sensor as to the likelihood of various hypothesized 
conditions. The purpose of the third component of our 
decomposed architecture is to fuse the information 
from different sensors into a reliable inference. 

Step 3: Sensor Fusion 

The key to obtaining a reliable overall in- 
ference is the reliability of the differential diagnoses 
which can be contributed by each sensor. Figure 3 
shows a typical problem of differential diagnosis. For 
this sensor, the main oxidizer valve surface and the 
main fuel valve surface intersect. The set of hidden- 
node-activations near this intersection are therefore 
consistent with blockage of either the main oxidizer 
valve or the main fuel valve. In this region of the 
hidden-node space, differential diagnosis by this sen- 
sor would be quite unreliable. On the other hand, for 
points which are near one surface but not near to the 
other surface, the data is consistent with only one 
interpretation and therefore the differential diagnosis 
is more reliable. Finally, points far from both surfaces 
may indicate either an unknown condition or a faulty 


sensor. These different possibilities may be repre- 
sented by the set of distances from a given point 
(s'ji,... l s'jH) in the hidden-node space to each of the 
M surfaces {/t(i^'(c;ixc;2))}, i = 1 in that space (as 

well as the distance to the "normal" point i?(F,(co)) 
derived from sensor values under normal steady-state 
conditions). 

More specifically, let us define dij to be the 
distance from the point to the nearest 

point on the surface R(Fj(cnxciz)), or to the normal 
point R(Fj(co)) when i - 0. Since the point (s'j i,...,s'jH) 
was derived from sensor S'j ’s data, and since the 
surface R(Fj(cnxci 2 )) gives the set of such points 
predicted by hypothesis Cj, then dij indicates how far 
sensor S'j ’s data is from the predictions of hypothesis 
Ci. This distance can be turned into a consistency 
measure by defining a tolerance Di for each surface 
derived from the variance observed in the set of train- 
ing data used to determine each surface 
R(Fj(ciixciz)). (A tolerance Do can also be defined 
around the normal point R(Fj(co)). A new data point 
closer than this tolerance to the surface should be 
taken as evidence in favor of the hypothesis, while a 
data point farther away should be taken as evidence 
against. We therefore take Di~dij as a measure of the 
consistency of the data from sensor S j with hypothesis 
Ci. 

Suppose, then, that these consistency 
measures are represented as activations of input 
nodes in a layered neural network to perform sensor 
fusion, as in Figure 4. In general, such a network 
would require {M+l)xN input nodes 
Xij, i = 1 that is, one node Xij for each 

pairing between M+l hypothesized conditions and N 
sensors (counting the steady-state condition as one 
hypothesis). Each input node Xij of the sensor fusion 
network receives a scalar input Di-dij, where Di is the 
tolerance associated with surface R(Fj(cnxci2)) and 
where dij is the distance between the point 
(s'ji,...,s'jH) = i?({sjv}) and the surface R(Fj(cnxci2)). 
(Recall that R is the data compression function per- 
formed by the autoassociative network trained in Step 
1 above, andFj(az) gives the series of points generated 
by parameter Ci2 of condition Ci during the training of 
Step 2.) Each such input node activation Di-dij there- 
fore represents the consistency of data from sensor 
Sj with hypothesis Ci. 

The desired output of the sensor fusion net- 
work is the most likely hypothesis based on all sensor 
data. The network therefore contains M+l output 
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FIGURE 3. Surfaces representing Main Oxidizer Valve and Main Fuel Valve blockages for High Pressure Fuel 
Turbopump Inlet Temperature sensor. 


nodes, one node Yi for each possible hypothesis of a 
fault condition Ci, i = l, ...fit, and one node Yo for the 
hypothesis of normal steady-state operation. Given 
these sets of input nodes and output nodes, and 
restricting ourselves to one layer of (M+l) 2 hidden 
nodes, we can then define two possible architectures 
which differ only in the connectivity from the input 
nodes to the hidden nodes and from the hidden nodes 
to the output nodes. These two architectures are 
described below: 

Fusion Architecture A is shown in Figure 4. A 
structured connectivity pattern is defined which con- 
tains (M+l) 2 hidden nodes IFik, i = 0,...^, k = 0,...,M. 
First let us consider the case where i * k. Each hidden 
node has random excitatory connections 

from the set of input nodes (Xi/}, j = and random 


inhibitory connections from the set of input nodes 
\Xkj\,j = 1,...^V. Hidden node Hik therefore calculates 
a weighted sum of evidence from all sensors. In this 
weighted sum, evidence in favor of hypothesis Ci 
counts positively, but evidence against hypothesis Ci 
counts negatively. Conversely, evidence in favor of 
condition Ck counts negatively in the weighted sum, 
but evidence against hypothesis Ck counts positively. 
Hidden node ITik is thus prewired to receive all data 
relevant to a differential diagnosis of hypothesis Ci 
over hypothesis Ck. (Conversely, hidden node IFik 
would receive data relevant to a differential diagnosis 
of hypothesis Ck over hypothesis Ci. For example, if 
M= 2 classes of fault conditions, then there are 
M{M+ 1)=6 hidden nodes ITik with it-k to perform all 
pairwise differential diagnoses between hypotheses 



Ci and Ck. These are illustrated as the upper 6 hidden 
nodes in the middle layer of Figure 4. 

The M+l remaining hidden nodes (designated 
H'u for convenience, i = 0,...,M) are also prewired to 
perform differential diagnoses, but in this case each 
differential diagnosis JET a is between (i) the hypothesis 
that the current condition of the system is Ci, and (ii) 
the hypothesis that the current condition of the sys- 
tem belongs to some unknown class of anomalies 
which has not been seen during training. (In the case 
where M= 2, there would be 3 such hidden nodes, 
illustrated as the lower 3 hidden nodes of Figure 4.) 
In essence, each of these hidden nodes H'u is attempt- 
ing to detect unknown anomalies in general, and more 
specifically to differentiate the class of such unknown 
anomalies from a particular known condition Ci. Since 
by definition no training examples are available for 
unknown anomalies, the hidden nodes H'u are 
prewired with inhibitory connections suitable for 
detecting unknowns. Recall that the activity Di-dij of 
each input node Xij is negative if the distance dij from 
the nearest known condition Ci is greater than the 
preset tolerance Di. Therefore, a negative activation 
in any input node should count as a contribution to the 
evidence for an unknown anomaly. Moreover, a nega- 
tive activation in the particular input nodes Xij repre- 
senting distance from the known condition Ci should 
especially count as evidence to differentiate an un- 
known anomaly from the known condition Ci. Each 
hidden node H'u for detecting unknown anomalies 
should therefore have some inhibitory input from all 
input nodes (as evidence for an unknown anomaly), 
and stronger inhibitory weights from the particular 
input nodes XijJ = 0,...,M (as evidence to differentially 
diagnose an unknown anomaly from the condition Ci). 
In figure 4, dotted lines from input nodes to the lower 
3 hidden nodes indicate connections which are initial- 
ly set to strongly inhibitory weights to perform each 
differential diagnosis. As we have explained, there are 
also weaker inhibitory connections (not shown in the 
figure) from all other input nodes to each of the lower 
3 hidden nodes. 

The weights leading to any given hidden node 
define a discriminant function which is customarily 
thought of as a hyperplane in the space of all possible 
input vectors. All points (input vectors) on one side of 
this hyperplane result in a positive activation of the 
given hidden node; all points on the other side result 
in a negative activation. This discrimination is shar- 
pened by the nonlinear sigmoid function applied to the 
activation to yield the hidden node’s output. The 



FIGURE 4. Architecture A: Prestructured to perform 
differential diagnosis (all connections not shown). 
Solid lines Indicate connections Initially set to 
positive wieghts (before training begins); dotted 
lines indicate connections initially set to negative 
weights. 

weight changes of back propagation training in effect 
move each hyperplane in a direction which lessens the 
mean-squared error of the network’s output over the 
training set. 


From this standpoint, the differential diag- 
nosis performed by each hidden nod eH'ik, k#i, can be 
thought of as a hyperplane intended to separate the 
data points representing training examples of condi- 
tion Ci from those representing training examples of 
condition Cj. While the initial setting of weights biases 
each hidden node H'ik to perform this differential 
diagnosis, back propagation will move this hyperplane 
in whichever direction minimizes the mean-squared 
error of the network over the set of training examples. 


Similarly, the role of each hidden node H'u can 
be viewed as a hyperplane intended to distinguish all 
known conditions Ci on one side, from unknown 
anomalies on the other side of the hyperplane. M+l 
such hyperplanes are created by the hidden nodes 
H'u, i = 0,...,M. These M+l hyperplanes are initially 
placed in different positions due to the stronger in- 
hibitory weights assigned to the inputs XijJ = 0,...,M 
than to the other inputs by each particular hidden 
node H'u, as explained above. Since no training ex- 
amples are available for unknown anomalies, back 
propagation might conceivably reduce the number of 
known training examples erroneously classified as 
unknown, but would not be expected to improve the 
recognition of unknown anomalies as such. We 
hypothesized, however, that the effect of this inherent 
limitation of example-based training would be les- 
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INPUT 


sened by the prestructuring (initial setting of connec- 
tions and weights) of architecture A described above. 

Finally, each output node Yi receives initially 
excitatory connections from the hidden nodes 
{H'ik) k = (k * i), initially inhibitory connec- 
tions from the hidden nodes \H'ki\ k = {k i), 

and an initially inhibitory connection from the hidden 
node ifii. Each output node Yi thus receives ex- 
citatory input from all differential diagnoses favoring 
hypothesis Ci, and inhibitory input from all differen- 
tial diagnoses opposing hypothesis Ci. 

Fusion Architecture B is shown in Figure 5. 
This architecture has the same number of hidden 
nodes as architecture A, and the same number of input 
and output connections per hidden node, but with 
random connectivity from input to hidden and hidden 
to output layers, and random assignment of initial 
weights. This architecture is intended to serve as a 
"control" case against which the effects of the initial 
structuring of architecture A can be evaluated. 
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FIGURE 5. Architecture B: Fully connected with 
random Initial weights (all connections not shown). 
AH connections are Initially set to random weights 
drawn from a uniform distribution between -0.1 and 
0 . 1 . 


Both architectures for sensor fusion were 
trained with the same back-propagation algorithm, 
using the outputs of the training data that were used 
in step 2 to determine the surfaces 
{ft(Fy(ciiXCt2))}, i = 1,-^ for each sensor j. Our 
hypothesis was that the back-propagation algorithm 
constrained to the connectivity of architecture A 
would (i) result in a set of weights from input nodes to 
hidden nodes which allow the hidden nodes to perform 
pairwise differential diagnoses, (ii) would reliably dif- 
ferentiate unknown engine conditions (not present 
during training) from the known classes of engine 
conditions present during training (a capability ex- 
pected to be lacking in the unstructured back propaga- 
tion architecture B), and (iii) that this would be 
accomplished without sacrificing diagnostic perfor- 
mance for known engine conditions, in comparison to 
the unstructured back propagation architecture. 

Testing Procedure 

The decomposed architecture (performing 
data compression, hypothesis generation, and sensor 
fusion) was trained as described in steps 1-3 above on 
simulated SSME data for the four sensors illustrated 
in Figure 1: high pressure fuel turbopump tempera- 
ture, engine thrust, chamber coolant valve pressure, 
and main fuel valve pressure. The training set con- 
sisted of normal data and two fault conditions, main 
oxidizer valve (MOV) blockage and main fuel valve 
(MFV) blockage. Each fault condition was included in 
the training set at three different levels of severity, 


and three different onset intervals for each level of 
severity. In order to compare the performance of sen- 
sor fusion architectures A and B, they were each 
trained on the identical output of Steps 1 and 2 of the 
training procedure. 

After training was completed, the perfor- 
mance of architectures A and B were compared on test 
data containing the 3 "known" conditions used in 
training (normal, MOV blockage, and MFV blockage) 
and 2 additional fault conditions that were not 
presented during training, Oxidizer Preburner Valve 
(OPV) blockage and Fuel Preburner Valve (FPV) 
blockage. These two additional "unknown" conditions 
were included in order to test each architecture’s 
ability to detect fault conditions which had not been 
included in the training set. 600 instances of each 
condition were generated which differed only in the 
amount of noise added to the simulated data. For each 
noise level, 100 instances of each condition were 
generated with the inclusion of random noise at that 
noise level. During testing each example was clas- 
sified as one of the three training conditions (normal, 
MOV blockage, or MFV blockage) based on the maxi- 
mally active output node, or as "unknown" if none of 
the output nodes was activated above its threshold 
value. Each architecture (A and B) was tested using a 
range of different thresholds, and the threshold yield- 
ing the best performance for that network was used. 
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Table I ___ 

Percent correct classification 

at various noise levels performed by prestructured architecture A. 


Fault 

Noise Level (%) I 

0.0 

0.5 

1.0 

1.5 

2.0 

Normal (Known) 

100 

100 

95 

81 

70 

MOV (Known) 

100 

100 

100 

100 

100 

MFV (Known) 

100 

97 

86 

85 

80 

OPV (Unknown) 

100 

100 

100 

100 

100 

FPV (Unknown) 

100 

100 

100 

100 

100 


Table II 

Percent correct classification 

at various noise levels performed by fully connected architecture B. 


Fault 

Noise Level (%) i 

0.0 

0.5 

1.0 

1.5 

2.0 

Normal (Known) 

100 

100 

94 

76 

69 

MOV (Known) 

100 

100 

100 

100 

98 

MFV (Known) 

100 

94 

86 

79 

79 

OPV (Unknown) 

0 

0 

0 

1 

5 

FPV (Unknown) 

0 

0 

0 

0 

0 


Results and Discussion 

Table I shows the results of testing sensor 
fusion architecture A. As explained above, this neural 
network architecture was structured prior to training 
to perform differential diagnoses among the different 
conditions to be presented during training, and be- 
tween each of these conditions and the class of un- 
known faults not presented during training. Table II 
shows the results of testing architecture B using the 
same training data and same testing data. Architec- 
ture B differed from architecture Ain that the connec- 
tion weights were not structured prior to training, but 
rather were set to random initial values as typically 
done in neural networks. 

Each column in Tables I and II shows the 
results obtained for a given noise level in the test data. 
Within each column, classification performance (per- 
cent correct classifications)is shown separately for the 
normal operation condition and for each fault condi- 
tion. 

The first three rows of each table show the 
performance on test data representing the three 
known conditions (normal and two types of faults) that 
were presented during training. The performance of 
both architectures declined with the addition of 
greater levels of noise to the test data, with architec- 
ture A performing very slightly better at the higher 


noise levels, As expected, the two architectures did 
not differ greatly in their classification performance 
on these known conditions. This is consistent with our 
hypothesis that the initial structuring would not 
detract from the network’s ability to correctly classify 
new examples of the same conditions presented 
during training. 

The last two rows of each table show the 
results of testing with the two unknown fault condi- 
tions which had not been presented during training. 
Since a conventional back propagation network learns 
to classify based on its training examples only, we 
expected architecture B not to be able to recognize new 
fault conditions as unknown, but rather to classify 
them into one of the classes that had been presented 
during training. This is indeed what happened. As 
shown in the last two rows of Table II, its performance 
was five percent or less correct classifications at each 
noise level of these two unknown faults. Recall that 
results are shown for the best threshold setting for 
each network. In other words, there was no threshold 
setting which would allow the output nodes of ar- 
chitecture B to differentiate known examples from 
unknown examples on the basis of its output node 
activity. 

Architecture A, by contrast, was able to cor- 
rectly identify new faults as unknown, based on below- 
threshold activity in all output nodes. This is shown 





in the last two rows of Table I, in which all examples 
of new faults at all noise levels are correctly classified 
on this basis. Since the only difference between ar- 
chitectures A and B is in the initial structuring of 
connections prior to training, it is reasonable to con- 
clude that the initial structuring of architecture A to 
perform differential diagnoses allows this network to 
'detect examples of unknown classes that were not 
presented during training. 

The structured back-propagation network of 
architecture A could be viewed as a hybrid between a 
knowledge-based expert system and an example- 
trained neural network. In knowledge-based systems, 
both the general format of the rules and the exact 
instantiations of the rules are extracted from human 
experts. In conventional back-propagation networks, 
hidden nodes serve a function analogous to rules in an 
expert system. The general format of such a "hidden- 
node rule" is determined by which input nodes have 
significant connections to each hidden node, while the 
exact instantiation of such a rule is given by the exact 
weights which result from training. In conventional 
back-propagation, both the format of the hidden-node 
rules and the exact instantiation of these rules are 
implicitly determined from training examples by the 
back-propagation algorithm. The structured back- 
propagation architecture A above represents a hybrid 
between these two approaches. The general format of 
its "hidden-node rules" is determined in advance by 
the connectivity specified for architecture A, and is 
based on expert judgment about the general utility of 
rules based on differential diagnosis. Within this 
general format, however, the specific instantiation of 
each differential diagnosis rule is determined by the 
exact weights which are learned from training ex- 
amples via back propagation. This initial structuring 
of the back-propagation architecture using expert 
human judgment allows the neural network to detect 
the occurrence of faults for which no training ex- 
amples were presented. This dependence on expert 
human judgment is much less than in a rule-based 
expert system, however, since the exact instantiations 
of the rules are still learned from training examples. 
Training based on examples should make these "hid- 
den-node" rules easier to maintain than in a conven- 
tional rule-based expert system. However, this 
reduced dependence and implicit learning of "hidden- 
node-rules" makes it more difficult to provide explana- 
tions to the user about the inference process. 
Nevertheless, forcing the "hidden-node rules" into a 
predefined format allows the behavior of the network 
to be more easily compared with the behavior of 


human experts than in the case of an unstructured 
back-propagation network. This potentially would 
allow the network to "explain" the reasons for its 
diagnosis in terms understandable by a human 
operator. 
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